arXivrl 504.07459V 1 [cs.CL] 28 Apr 2015 


CommentWatcher: An Open Source Web-based platform 
for analyzing discussions on web forums 


Marian-Andrei Rizoiu 
ERIC Lab, University Lyon2 
Lyon, France 
Marian- 

Andrei. Rizoiu(a)univ- 
lyon2.fr 


Adrien Guille 
ERiC Lab, University Lyon2 
Lyon, France 
Adrien.Guille@univ- 
lyon2.fr 


Juiien Velcin 
ERIC Lab, University Lyon2 
Lyon, France 
Juiien. Veicin@univ- 
lyon2.fr 


ABSTRACT 

We present CommentWatcher, an open source tool aimed at 
analyzing discussions on web forums. Constructed as a web 
platform, CommentWatcher features automatic mass fetching 
of user posts from forum on multiple sites, extracting topics, 
visualizing the topics as an expression cloud and exploring 
their temporal evolution. The underlying social network of 
users is simultaneously constructed using the citation re¬ 
lations between users and visualized as a graph structure. 
Our platform addresses the issues of the diversity and dy¬ 
namics of structures of webpages hosting the forums by im¬ 
plementing a parser architecture that is independent of the 
HTML structure of webpages. This allows easy on-the-fly 
adding of new websites. Two types of users are targeted: 
end users who seek to study the discussed topics and their 
temporal evolution, and researchers in need of establishing a 
forum benchmark dataset and comparing the performances 
of analysis tools. 

Categories and Subject Descriptors 

H. 3.5 [Information Storage and Retrieval]: Online In¬ 
formation Services— Weh-hased services] 1.2.7 [Artificial In¬ 
telligence]: Natural Language Processing— Language pars¬ 
ing and understanding, Text analysis] H.3.5 [Information 
Storage and Retrieval] : Information Search and Retrieval— 
Clustering, Selection process 
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I. INTRODUCTION 

The Web 2.0 has changed the way users discuss with other 
users. One of the preferred online discussion environments 
are the web forums. Users can react, post their opinions, dis¬ 
cuss and debate any kind of subjects. The forums are usually 
thematic {e.g. Java programming forums) and new users 
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have access to the past discussion {e.g. solutions posted by 
other users to a specihc problem). Therefore the users be¬ 
come full collaborative participants in the information cre¬ 
ation process. The subjects of discussion between readers 
are very dynamic and the overall sum of reactions gives a 
snapshot of the general trends that emerge in the user popu¬ 
lation. At the same time, the way users reply one to another 
suggests an underlying social network structure. The fo¬ 
rum’s “reply-to” structural relations can be used to add links 
between users. Other types of relations can be added, like 
the name and textual citations [^. Furthermore, based on 
such social networks constructed from web forums, adapted 
graph measures can be used to detect user social roles [^. 

1.1 Current limitations 

These forum data are still ill explored, even if they repre¬ 
sent an important source of knowledge. News articles anal¬ 
ysis and micro blogging {e.g. Twitter) analysis receive a lot 
of attention from the community. There are available tools 
that perform the analysis of news media [^, but without 
treating the social network aspect. Other tools concentrate 
on analyzing and visualizing the social dynamics or de¬ 
tect events based on twitter data. To the best of our 
knowledge, there are no publicly available tools that treat 
forums, while inferring a social network structure. 

Another limitation concerns the forum benchmarks. There 
are a multitude of general purpose information retrieval data¬ 
sets {e.g. the ClueWebl2 dataset0of project Lemure) and 
of Twitter datasets {e.g. the infochimps collection^. 
But dedicated web forum benchmark datasets are scarce. 
Those that exist are usually issued from a single forum web¬ 
site (e.^. the boards, ie Forums Dat as et0based on boards, ie 
website or the Ancestry.com Forum Datasetj^ based on an¬ 
cestry, com website). This is due to the diverse and ever 
changing structure of the websites hosting the discussions 
and copyright problems. Each host website has its own li¬ 
cense on the user-produced data, which is not always clearly 
stipulated. This leads researchers to develop their own house- 
bred parsers and create their own datasets. These datasets 
are rarely shared with the community, which poses prob¬ 
lems when testing new proposals and comparing to existing 
approaches. 

^http://lemurproj ect.org/cluewebl2/specs.php 
^http://www.infochimps.com/collections/ 
twitter-census 

"^http: //www. icwsm.org/2012/submitting/datasets/ 

^http://www.cs.cmu.edu/~jelsas/data/ancestry.com/ 



1.2 Introducing Comment Watcher 

We address these issues by introducing CommentWatcher, 
an open source web-based platform for analyzing discus¬ 
sion on web forums. CommentWatcher was designed hav¬ 
ing in mind two types of users: the forum analyst, who 
seeks to understand the main topics of discussion and the 
social interactions between users, and the researcher who 
needs a benchmark to test his/her proposed approaches. 
Using CommentWatcher, the researcher can create forum dis¬ 
cussions benchmarks without worrying for copyright issues, 
since the platform is open source and the text itself is not 
distributed (each researcher can locally recreate the bench¬ 
mark dataset). 

When building CommentWatcher we address the challenges 
that arise from retrieving forums from multiple web sources. 
Not only these sources are profoundly heterogeneous in struc¬ 
ture, but they tend to change often and render parsers obso¬ 
lete. We implement a parser architecture which is indepen¬ 
dent from the website structure and allows simple on-the- 
fly adding of new sources and updating the existing ones. 
CommentWatcher also supports mass fetching of forums from 
supported sources by using keyword search on the internet, 
extracting discussion topics, creating the underlying social 
network structure of users and visualizing it in relation with 
the extracted topics. 

During the demonstration, the participants will be able 
to interact with CommentWatcher in a normal browser win¬ 
dow, through the tools web interface. The tool itself will 
be hosted and executed on its dedicated machine, located 
at the ERIC laboratory. The tools capabilities will be illus¬ 
trated by showing the participants, on-live, (a) how multiple 
discussion forums can be fetched by searching the web using 
keywords, (b) apply topics extraction algorithms and tweak 
their parameters, (c) visualize the extracted topic as a ex¬ 
pression cloud and their temporal evolution and (d) visual¬ 
ize the social network constructed starting from the initial 
forums. 


2. PLATFORM DESIGN 

In this section, we describe the software technologies used 
in developing CommentWatcher, the general architecture and 
the different components to highlight their aim and the way 
they interact. 

2.1 Software technologies 

CommentWatcher is written using Java Servlets for server- 
side computing and Java Server Page for the dymanic web¬ 
page generation. The support for fetching forums discus¬ 
sions from websites is implemented using the XLS Trans¬ 
formation technology. New websites can be added dynami¬ 
cally, without changing the source code. A MySql database 
is used for storing forum structure, user characteristics and 
the text. The visualization is performed client-side into a 
Java Applet. 

2.2 Platform architecture 

The application has three main modules, interconnected 
as shown in Figure^ The fetching module deals with down¬ 
loading the forums, parsing the web pages and storing the 
data into the database. Optionally, it can perform a key¬ 
word web search to find forums that can be fetched. The 
topic extraction module performs topic extraction using an 
algorithm implemented as a library on a selection of forums. 


The visualization module has two views: (i) topic visual¬ 
ization as an expression cloud and as a temporal evolution 
graphic and (ii) social network visualization. 



Figure 1: CommentWatcher: overview of the platform’s 
architecture. 


2.3 The fetching module 

This module deals with downloading, parsing and import¬ 
ing the forum data into the application. The main difficulty 
when parsing web pages is that the structure of each page is 
different. What is more, the structure of a certain web page 
tends to change over time. With CommentWatcher we have 
designed and implemented a meta-parser, which is indepen¬ 
dent on the website. The actual adaptation of the parser 
to a specific page is done using an external definition file, 
implemented in XSLT, a standardized and well documented 
language. Therefore, adding support for new websites or 
modifying existing ones boils down to just adding or mod¬ 
ifying definition files, without any change in the parser’s 
source code. 

The design schema of the fetching module, as well as its 
interactions with the user interface and the database, are 
given in Figure The download action specifies the URL 
of a forum to be downloaded. The bulk download follows the 
same idea, but a keyword web search is performed using the 
Bing API and all results from supported websites are down¬ 
loaded. A screenshot of the keyword web search and mass 
fetching is given in Figure The specified page will be 
downloaded in raw HTML format which will undergo clean¬ 
ing, XSL transformation and deserialization. The process 
of cleaning implies transforming the HTML document into 
a well formed XML. In the following step, the XSL trans¬ 
formation is applied to the valid XML document using one 
of the XSLT definition files of the supported websites. The 
result of the transformation is an XML document, which 
uses the same XML schema for all supported websites. The 
required data is then deserialized into Java objects, which 
can be further on stored in and retrieved from the database. 

The advantages of implementing such a parsing process 
are that it is simple, reliable, easy to understand and modify. 
Furthermore, it does not hard-code the website’s structure 
and it allows adding new supported websites on-the-fly. 

2.4 Topic extraction and textual classification 

This module allows extracting topics from texts from a 
selection of forums, already fetched in the database. The 
design is modular, the extraction itself being performed by 
external libraries. The text from selected forums is prepared 
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Figure 2: The design of the fetching module (a) and a screenshot of the keyword mass fetching process (b) 


and packaged in the format required by the topic extraction 
library and then passed to the library. The user interface 
allows setting the parameters for each library. Once the 
extraction is hnished, the results are saved into an XML 
document, which has the same format for all topic extrac¬ 
tion libraries. The XML document contains the expressions 
associated to each topic and their scores. 

At the present, Comment Watcher supports two topic ex¬ 
traction algorithms, provided by two libraries: Topical N- 
Grams provided by the Mallet Toolkit library]^ and 
CKP [^, provided by the CKP library. Topical NGrams is 
a graphical model algorithms, which models topics as distri¬ 
butions of probabilities over n-grams. GKP uses overlapping 
textual clustering (one text can belong to multiple clusters) 
and considers each cluster of the partition as a topic. The 
expressions stored in the XML result document are either 
(i) the resulted n-grams (for Topical NGrams) or (ii) the 
frequent expressions (for GKP). Their score is (i) the proba¬ 
bility to which an n-gram is associated to a topic (for Topical 
NGrams) or (ii) 1 — d{ei, /j.), where d{ei, /j.) is the normalized 
distance between the frequent expression ei and the topic’s 
centroid fi (for GKP). Support for new algorithms and li¬ 
braries can be added easily, but it requires writing adapters 
for the inputs and outputs. 

2.5 Visualization 

The visualization module is designed to help the user to 
quickly understand the extracted topics and visualize their 
temporal evolution. It is the only module that is executed 
client-side, in a Java Applet. After the XML object resulting 
from the topic extraction is loaded by the applet, two visual¬ 
izations are available: the expression cloud and the temporal 
evolution graphic. Figureshows a screenshot with the two 
visualizations. The expression cloud visualization is similar 
to the word cloud visualization, which the exception that it 
uses the expressions generated at the topic extraction mod¬ 
ule and their sizes are proportional with their score. The 
temporal evolution graphic portrays the popularity of each 
topic over the period of time. The time is discretized in a 
conhgurable number of intervals, the user posts associated 
to each topic in each interval are counted and graphics are 
generated for each forum or for each hosting website. 



Figure 3: The expression cloud visualization of top¬ 
ics and their temporal evolution. 


Social network visualization. 

To facilitate the exploration of the interactions between 
the members of the forum, we compute a visualization of 
the underlying social network. The network is colored ac¬ 
cording to the topics on which the users are interacting. We 
construct the social network as a labeled multidigraph, as 
shown in [^. We map the network nodes on the authors of 
messages. We add an arc labeled with the topic between 
two nodes when there is, between the two users, at least 
one direct reply belonging to the respective topic. We fur¬ 
ther enrich the network with user’s features as the number 
of posts, the number of topics a user participates in, the 
number of threads a user participates in, etc. Further mea¬ 
sures are calculated on the graph, such as the weighted in- 
and out-degree, the betweenness centrality and the closeness 
centrality. 

Figure shows how Comment Watcher displays the induced 
social network. The visualization is created with the Jung 
Graph Librar^Qand is interactive, so nodes can be selected 
in order to see their features. Relations can also be hltered 
in order to show only the network corresponding to certain 
topics. 


^http://jung.sourceforge.net 
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Figure 4: Visualizing the constructed social net¬ 
work, enriched with topical an user features. One 
can see, inter alia^ that the reply of “Robert” to 
“David VIETI” is associated to topic #5. 


3. LICENSE AND SOURCE CODE 

Comment Watcher is released under the opensource license 
GNU GPL The individual topic extraction and textual 
clustering software packages are the objects of their respec¬ 
tive licenses. The present version of Comment Watcher comes 
with two Natural Language Processing toolkits: the Mallet 
Toolkit v2.0.7, released under the open source Common 
Public License, and CKP[^ vO.2, released under the GNU 
GPL v3. The install files and the source code of Comment- 
Watcher is available through a public Mercurial repositor^Q 

4. RELATED WORKS 

Several tools intending to extract knowledge from on-line 
discussions have been proposed in the recent years. 

MAQSA is a system for social analytics on news that 
allows its users to define their own topic of interest, in or¬ 
der to gather related articles, identify related topics, and 
extract the time-line and network of comments that show 
who commented which article and when. 

Eddi offers visualizations such as time-lines and tag 
clouds of topics extracted from tweets using a simple topic 
detection algorithm that uses a search engine as an external 
knowledge base. 

OpinionCraw^ is an on-line service that crawls various 
web-sources - such as blogs, news, forums and Twitter - 
searching for a user-defined topic and then presents key con¬ 
cepts as a tag cloud, provides a visualization of the temporal 
dynamics of the topic and performs a sentiment analysis. 

SONDY is an open-source plateform for analyzing on¬ 
line social network data. It features a data import and pre¬ 
processing service, a topic detection and trends analysis ser¬ 
vice, as well as a service for the interactive exploration of 
the corresponding networks (ie., active authors for the con¬ 
sidered topic (s)). 

The aforementioned tools are limited for various reasons. 
They are either proprietary softwares and thus can’t be ex¬ 
tended for scientific purposes or can’t directly crawl web 
sources and can only be used to analyze formatted datasets 

^http://www.gnu.org/licenses/ 

^http://eric.univ-lyon2.fr/~commentwatcher/ 
cgi-bin/CommentWatcher.cgi/CommentWatcher/ 
http: //opinioncrawl. com 


provided by the user. Comment Watcher intends to provide re¬ 
searchers with an open-source extendable tool that permits 
to crawl the web and build datasets that suit their needs. 


5. CONCLUSION AND FUTURE WORKS 

In this paper we have presented CommentWatcher, an open 
source web-based platform for analyzing discussions on web 
forums. Our tool is designed for both end-users, as well as 
for researchers. End-users have at their disposal an easy to 
use, integrated tool that allows retrieving forum discussion 
from multiple websites, performs topic extraction to iden¬ 
tify the main discussion topics and provides an expression 
cloud visualization to identify the most important expres¬ 
sions associated to each topic. The temporal popularity of 
topics can be evaluated using an evolution graphic. Com¬ 
mentWatcher also features extracting the underlying social 
network by using the direct citation links between users. 
The visualization of the social network is interactive, fea¬ 
tures of nodes can be visualized and relations can be fil¬ 
tered to show only the network corresponding to a certain 
topic. Eor researchers, CommentWatcher tackles the problem 
of creating multi source web forum datasets, thanks to its 
versatile parser which is independent of the structure of web¬ 
pages. Support for new websites can be added on-the-fly. It 
can also solve the problem of copyright when sharing forum 
datasets, since no text is distributed and each researcher can 
easily recreate the dataset. As future work, we intend to add 
a credential mechanism and transform CommentWatcher into 
a multiuser tool. We consider implementing topic evaluation 
based on ontologies of concepts and a better plotting of the 
social network by using force-directed graph drawing. 
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